Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
                                            Some full text articles may not yet be available without a charge during the embargo (administrative interval).
                                        
                                        
                                        
                                            
                                                
                                             What is a DOI Number?
                                        
                                    
                                
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
- 
            Free, publicly-accessible full text available July 23, 2026
- 
            ABSTRACT Predicting the structure of ligands bound to proteins is a foundational problem in modern biotechnology and drug discovery, yet little is known about how to combine the predictions of protein‐ligand structure (poses) produced by the latest deep learning methods to identify the best poses and how to accurately estimate the binding affinity between a protein target and a list of ligand candidates. Further, a blind benchmarking and assessment of protein‐ligand structure and binding affinity prediction is necessary to ensure it generalizes well to new settings. Towards this end, we introduceMULTICOM_ligand, a deep learning‐based protein‐ligand structure and binding affinity prediction ensemble featuring structural consensus ranking for unsupervised pose ranking and a new deep generative flow matching model for joint structure and binding affinity prediction. Notably,MULTICOM_ligand ranked among the top‐5 ligand prediction methods in both protein‐ligand structure prediction and binding affinity prediction in the 16th Critical Assessment of Techniques for Structure Prediction (CASP16), demonstrating its efficacy and utility for real‐world drug discovery efforts. The source code for MULTICOM_ligand is freely available on GitHub.more » « lessFree, publicly-accessible full text available April 8, 2026
- 
            Abstract The oviduct is the site of fertilization and preimplantation embryo development in mammals. Evidence suggests that gametes alter oviductal gene expression. To delineate the adaptive interactions between the oviduct and gamete/embryo, we performed a multi-omics characterization of oviductal tissues utilizing bulk RNA-sequencing (RNA-seq), single-cell RNA-sequencing (scRNA-seq), and proteomics collected from distal and proximal at various stages after mating in mice. We observed robust region-specific transcriptional signatures. Specifically, the presence of sperm induces genes involved in pro-inflammatory responses in the proximal region at 0.5 days post-coitus (dpc). Genes involved in inflammatory responses were produced specifically by secretory epithelial cells in the oviduct. At 1.5 and 2.5 dpc, genes involved in pyruvate and glycolysis were enriched in the proximal region, potentially providing metabolic support for developing embryos. Abundant proteins in the oviductal fluid were differentially observed between naturally fertilized and superovulated samples. RNA-seq data were used to identify transcription factors predicted to influence protein abundance in the proteomic data via a novel machine learning model based on transformers of integrating transcriptomics and proteomics data. The transformers identified influential transcription factors and correlated predictive protein expressions in alignment with the in vivo-derived data. Lastly, we found some differences between inflammatory responses in sperm-exposed mouse oviducts compared to hydrosalpinx fallopian tubes from patients. In conclusion, our multi-omics characterization and subsequent in vivo confirmation of proteins/RNAs indicate that the oviduct is adaptive and responsive to the presence of sperm and embryos in a spatiotemporal manner.more » « lessFree, publicly-accessible full text available February 13, 2026
- 
            Abstract MotivationAs fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt. ResultsWe introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels [Gene Ontology (GO) terms] to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function-relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy, but delivers a robust performance of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms. Availability and implementationhttps://github.com/BioinfoMachineLearning/TransFew.more » « less
- 
            Abstract Generative deep learning methods have recently been proposed for generating 3D molecules using equivariant graph neural networks (GNNs) within a denoising diffusion framework. However, such methods are unable to learn important geometric properties of 3D molecules, as they adopt molecule-agnostic and non-geometric GNNs as their 3D graph denoising networks, which notably hinders their ability to generate valid large 3D molecules. In this work, we address these gaps by introducing the Geometry-Complete Diffusion Model (GCDM) for 3D molecule generation, which outperforms existing 3D molecular diffusion models by significant margins across conditional and unconditional settings for the QM9 dataset and the larger GEOM-Drugs dataset, respectively. Importantly, we demonstrate that GCDM’s generative denoising process enables the model to generate a significant proportion of valid and energetically-stable large molecules at the scale of GEOM-Drugs, whereas previous methods fail to do so with the features they learn. Additionally, we show that extensions of GCDM can not only effectively design 3D molecules for specific protein pockets but can be repurposed to consistently optimize the geometry and chemical composition of existing 3D molecules for molecular stability and property specificity, demonstrating new versatility of molecular diffusion models. Code and data are freely available onGitHub.more » « less
- 
            Abstract Estimating the accuracy of protein structural models is a critical task in protein bioinformatics. The need for robust methods in the estimation of protein model accuracy (EMA) is prevalent in the field of protein structure prediction, where computationally‐predicted structures need to be screened rapidly for the reliability of the positions predicted for each of their amino acid residues and their overall quality. Current methods proposed for EMA are either coupled tightly to existing protein structure prediction methods or evaluate protein structures without sufficiently leveraging the rich, geometric information available in such structures to guide accuracy estimation. In this work, we propose a geometric message passing neural network referred to as the geometry‐complete perceptron network for protein structure EMA (GCPNet‐EMA), where we demonstrate through rigorous computational benchmarks that GCPNet‐EMA's accuracy estimations are 47% faster and more than 10% (6%) more correlated with ground‐truth measures of per‐residue (per‐target) structural accuracy compared to baseline state‐of‐the‐art methods for tertiary (multimer) structure EMA including AlphaFold 2. The source code and data for GCPNet‐EMA are available on GitHub, and a public web server implementation is freely available.more » « less
- 
            Abstract MotivationThe field of geometric deep learning has recently had a profound impact on several scientific domains such as protein structure prediction and design, leading to methodological advancements within and outside of the realm of traditional machine learning. Within this spirit, in this work, we introduce GCPNet, a new chirality-aware SE(3)-equivariant graph neural network designed for representation learning of 3D biomolecular graphs. We show that GCPNet, unlike previous representation learning methods for 3D biomolecules, is widely applicable to a variety of invariant or equivariant node-level, edge-level, and graph-level tasks on biomolecular structures while being able to (1) learn important chiral properties of 3D molecules and (2) detect external force fields. ResultsAcross four distinct molecular-geometric tasks, we demonstrate that GCPNet’s predictions (1) for protein–ligand binding affinity achieve a statistically significant correlation of 0.608, more than 5%, greater than current state-of-the-art methods; (2) for protein structure ranking achieve statistically significant target-local and dataset-global correlations of 0.616 and 0.871, respectively; (3) for Newtownian many-body systems modeling achieve a task-averaged mean squared error less than 0.01, more than 15% better than current methods; and (4) for molecular chirality recognition achieve a state-of-the-art prediction accuracy of 98.7%, better than any other machine learning method to date. Availability and implementationThe source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/GCPNet.more » « less
- 
            Since the 14th Critical Assessment of Techniques for Protein Structure Prediction (CASP14), AlphaFold2 has become the standard method for protein tertiary structure prediction. One remaining challenge is to further improve its prediction. We developed a new version of the MULTICOM system to sample diverse multiple sequence alignments (MSAs) and structural templates to improve the input for AlphaFold2 to generate structural models. The models are then ranked by both the pairwise model similarity and AlphaFold2 self-reported model quality score. The top ranked models are refined by a novel structure alignment-based refinement method powered by Foldseek. Moreover, for a monomer target that is a subunit of a protein assembly (complex), MULTICOM integrates tertiary and quaternary structure predictions to account for tertiary structural changes induced by protein-protein interaction. The system participated in the tertiary structure prediction in 2022 CASP15 experiment. Our server predictor MULTICOM_refine ranked 3rd among 47 CASP15 server predictors and our human predictor MULTICOM ranked 7th among all 132 human and server predictors. The average GDT-TS score and TM-score of the first structural models that MULTICOM_refine predicted for 94 CASP15 domains are ~0.80 and ~0.92, 9.6% and 8.2% higher than ~0.73 and 0.85 of the standard AlphaFold2 predictor respectively.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
